This examples summarizes the results of 4 evaluation tasks (gpqa_diamond, aime2024, mmlu_pro, cybench, and swe_bench) across 4 models (OpenAI o4-mini and o3, and Anthropic Claude Sonnet 3.7 and 4.0).
Bar Chart
We start with a simple bar chart faceted by evaluation task:
Facet the x-axis (i.e. create multiple groups of bars) by task name.
2
We don’t need an explicit “model” or “task_name” label as they are obvious from context. We also don’t need ticks b/c the fill color and legend provide this.
3
Ensure that y-axis shows the full range of scores (by default it caps at the maximum).
Confidence Interval
Here, we add a confidence interval for each reported score by adding a rule_x() mark. Note that we compute the confidence interval range dynamically using a sql() transform:
Dynanically compute each side of the confidence interval using a sql() transform.
2
Draw the confidence interval using a rule_x() mark.
Filtering
Here we add filtering inputs to enable viewing a single model and/or single task at a time. We use the hconcat() and vconcat() functions to layout the inputs and the plot.